Back

Quantitative Biology

Wiley

Preprints posted in the last 30 days, ranked by how well they match Quantitative Biology's content profile, based on 11 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Enhancing non-local interaction modeling for ab initio biomolecular calculations and simulations with ViSNet-PIMA

Cui, T.; Wang, Z.; Wang, T.

2026-03-20 bioinformatics 10.64898/2026.03.18.712561 medRxiv
Top 0.1%
3.7%
Show abstract

AI-based molecular dynamics simulation brings ab initio calculations to biomolecules in an efficient way, in which the machine learning force field (MLFF) locates at the central position by accurately predicting the molecular energies and forces. Most existing MLFFs assume localized interatomic interactions, limiting their ability to accurately model non-local interactions, which are crucial in biomolecular dynamics. In this study, we introduce ViSNet-PIMA, which efficiently learns non-local interactions by physics-informed multipole aggregator (PIMA) and accurately encodes molecular geometric information. ViSNet-PIMA outperforms all state-of-the-art MLFFs for energy and force predictions of different kinds of biomolecules and various conformations on MD22 and AIMD-Chig datasets, while adapting the PIMA blocks into other MLFFs further achieves 55.1% performance gains, demonstrating the superiority of ViSNet-PIMA and the universality of the model design. Furthermore, we propose AI2BMD-PIMA to incorporate ViSNet-PIMA into AI2BMD simulation program by introducing "Transfer Learning-Pretraining-Finetuning" scheme and replacing molecular mechanics-based non-local calculations among protein fragments with ViSNet-PIMA, which reduces AI2BMDs energy and force calculation errors by more than 50% for different protein conformations and protein folding and unfolding processes. ViSNet-PIMA advances ab initio calculation for the entire biomolecules, amplifying the application values of AI-based molecular dynamics simulations and property calculations in biochemical research.

2
Noisy periodicity in tropical respiratory disease dynamics

Yang, F.; Hanks, E. M.; Conway, J. M.; Bjornstad, O. N.; Thanh, N. T. L.; Boni, M. F.; Servadio, J. L.

2026-04-13 epidemiology 10.64898/2026.04.10.26350660 medRxiv
Top 0.3%
1.5%
Show abstract

Infectious disease surveillance systems in tropical countries show that respiratory disease incidence generally manifests as year-round activity with weak fluctuations and irregular seasonality. Previously, using a ten-year time series of influenza-like illness (ILI) collected from outpatient clinics in Ho Chi Minh City (HCMC), Vietnam, we found a combination of nonannual and annual signals driving these dynamics, but with unknown mechanisms. In this study, we use seven stochastic dynamical models incorporating humidity, temperature, and school term to investigate plausible mechanisms behind these annual and nonannual incidence trends. We use iterated filtering to fit the models and evaluate the models by comparing how well they replicate the combination of annual and nonannual signals. We find that a model including specific humidity, temperature, and school term best fits our observed data from HCMC and partially reproduces the irregular seasonality. The estimated effects from specific humidity and temperature on transmission are nonlinearly negative but weak. School dismissal is associated with decreased transmission, but also with low magnitude. Under these weak external drivers, we hypothesize that stochasticity makes a strong sub-annual cycle more likely to be observed in ILI disease dynamics. Our study shows a possible mechanism for respiratory disease dynamics in the tropics. When the external drivers are weak, the seasonality of respiratory disease dynamics is prone to the influence of stochasticity.

3
Nonlinear mixed-effect models and tailored parametrization schemes enables integration of single cell and bulk data

Wang, D.; Froehlich, F.; Stapor, P.; Schaelte, Y.; Huth, M.; Eils, R.; Kallenberger, S.; Hasenauer, J.

2026-04-09 systems biology 10.64898/2026.04.06.716803 medRxiv
Top 0.4%
1.3%
Show abstract

Experimental methods for characterizing single cells and cell populations have improved tremendously over the past decades. This progress has enabled the development of quantitative, mechanistic models for cellular processes based on either single cell or bulk data. However, coherent statistical frameworks for the model-based integration of different data types at the single-cell and population levels are still missing. In this work, we present a mathematical modeling approach for integrating single-cell time-lapse, single-cell snapshot, single-cell time-to-event and population-average data. Utilizing a formulation based on nonlinear mixed-effect modeling, we enable the description of multiple data types, with and without single-cell resolution, and we propose a tailored parameter estimation method. Furthermore, we propose a tailored parameter estimation scheme that facilitates the assessment of underlying process parameters. Our study demonstrates that the proposed approach can reliably integrate diverse data types, thereby improving parameter identifiability and prediction accuracy. Applying this framework of extrinsic apoptosis reveals that simultaneously considering multiple data types can be essential, particularly when experimental constraints limit data availability. The proposed approach is broadly applicable and may significantly advance our understanding of complex biological processes.

4
CGRig: a rigid-body protein model with residue-level interaction sites for long-time and large-scale protein assembly simulation

Teshirogi, Y.; Terada, T.

2026-03-24 biophysics 10.64898/2026.03.21.713350 medRxiv
Top 0.4%
1.2%
Show abstract

Molecular dynamics (MD) simulations are a powerful tool for investigating biomolecular dynamics underlying biological functions. However, the accessible spatiotemporal scales of conventional all-atom simulations remain limited by high computational costs. Coarse-graining reduces these costs by decreasing the number of interaction sites and enabling longer timesteps. In extreme cases, proteins are represented as single spherical particles; while such approximations facilitate cellular-scale simulations, they often sacrifice essential structural information, such as molecular shape and interaction anisotropy. Here, we present CGRig, a rigid-body protein model with residue-level interaction sites designed for long-time, large-scale simulations. In CGRig, each protein is treated as a single rigid-body embedding residue-level interaction sites. Its translational and rotational motions are described by the overdamped Langevin equation incorporating a shape-dependent friction matrix. Intermolecular interactions are calculated using G[o]-like native contact potentials, Debye-Huckel electrostatics, and volume exclusion. We validated that CGRig accurately reproduces the translational and rotational diffusion coefficients expected from the friction matrix for an isolated protein. For dimeric systems, the model successfully maintained native complex structures. Furthermore, two initially separated proteins converged into the correct complex with an association rate consistent with all-atom simulations. Notably, CGRig achieved a simulation performance exceeding 17 s/day for a 1,024-molecule system. These results demonstrate that CGRig provides an efficient framework for simulating protein assembly while retaining residue-level interaction specificity, making it a valuable tool for investigating large-scale biomolecular self-assembly.

5
CASPULE: A computational tool to study sticker spacer polymer condensates

Chattaraj, A.; Kanovich, D. S.; Ranganathan, S.; Shakhnovich, E. I.

2026-03-20 biophysics 10.1101/2025.11.09.687447 medRxiv
Top 0.4%
1.1%
Show abstract

Phase separated condensates are recognized as a ubiquitous mechanism of spatial organization in cell biology. Biophysical modeling of condensates provides critical insights into the dynamics and functions of these subcellular structures that are difficult to extract via experiments. Here we present an efficient computational pipeline, CASPULE (Condensate Analysis of Sticker Spacer Polymers Using the LAMMPS Engine), to simulate and analyze the biological condensates made of sticker-spacer polymers. CASPULE implements a unique force field that combines traditional Langevin dynamics with a "detailed balance proof" protocol for single-valent bond formation between stickers. This framework allows us to study the non-trivial biophysics that emerge out of the single-valent sticker interactions coupled with the effect of separation in energetic contribution by stickers and spacers. We provide detailed documentation on how to setup the simulation environment, perform simulations and analyze the results. Through case studies, we highlight the utility and efficacy of our pipeline. Importantly, we provide statistical parameters to characterize the cluster size distribution often observed in biological systems. We envision this tool to be broadly useful in decoding the interplay of kinetics and thermodynamics underlying the formation and function of biological condensates.

6
Educational Browser-Native SIR Simulation: Analytical Benchmarks Showing Numerical Accuracy for Lightweight Epidemic Modeling

Ben-Joseph, J.

2026-04-17 epidemiology 10.64898/2026.04.15.26350961 medRxiv
Top 0.5%
1.0%
Show abstract

Lightweight epidemic calculators are widely used for teaching and rapid scenario exploration, yet many omit the methodological detail needed for scientific reuse. We present a browser-native SIR calculator that exposes forward Euler and classical fourth-order Runge--Kutta (RK4) integration alongside epidemiologically interpretable outputs and a population-conservation diagnostic. The implementation is anchored to analytical properties of the deterministic SIR system, including the epidemic threshold, the peak condition, and the final-size relation. Benchmark experiments show that RK4 is essentially step-size invariant over practical discretizations, whereas Euler at a coarse one-day step overestimates peak prevalence by 3.97% and final size by 0.66% relative to a fine-step RK4 reference. These results demonstrate that browser-based tools can support publication-quality computational narratives when solver choice, diagnostics, and assumptions are treated as first-class outputs.

7
OpenCafeMol with 3SPN.2 DNA model: GPU Acceleration for Long-Time Coarse-Grained Chromatin Simulations

Yamauchi, M.; Murata, Y.; Niina, T.; Takada, S.

2026-03-19 biophysics 10.64898/2026.03.18.712524 medRxiv
Top 0.5%
0.9%
Show abstract

There is a growing demand for molecular dynamics simulations to explore longer timescale behavior of giant protein-DNA complexes such as chromatin. To address this need, we extended OpenCafeMol, a GPU-accelerated residue-level coarse-grained molecular dynamics simulator originally developed for proteins and lipids, to support 3SPN.2 and 3SPN.2C DNA models. We also implemented a hydrogen-bond-type many-body potential to model DNA-protein interactions more accurately. To further improve computational efficiency, we introduced a localized scheme for calculating base-pairing and cross-stacking interactions. Benchmark tests show that OpenCafeMol on a single GPU achieves up to 200-fold speed-up for DNA-only systems and up to 100-fold speed-up for DNA-protein complexes compared to CPU-based simulations. To demonstrate the capability of our implementation for long-timescale biological processes, we simulated an archaeal SMC-ScpA complex undergoing DNA translocation via segment capture (a proposed mechanism for DNA loop extrusion) in the presence of a DNA-bound obstacle. We observed continuous captured-loop growth accompanied by obstacle bypass within the segment capture framework.

8
Triangular Invariant Sets for Containment of Drug Resistance Under Evolutionary Therapy

Hernandez Vargas, E. A.

2026-03-27 evolutionary biology 10.64898/2026.03.26.714636 medRxiv
Top 0.6%
0.9%
Show abstract

Evolutionary therapies regulate heterogeneous populations by altering selective pressures through treatment sequences in cancer and infections. This letter develops an invariant-set framework for treatment-induced containment based on positive triangular invariant sets. For periodically switched systems, sufficient conditions are derived for the existence of such invariant regions. Robustness with respect to mutation is established by showing that the invariant simplex persists under small perturbations of the subsystem matrices. In the two-phenotype case, the analysis yields an explicit mutation threshold that separates regimes in which therapy cycling maintains containment from regimes in which mutation can enable evolutionary escape. Simulations illustrate the geometry of the invariant sets and the role of mutation and dwell time in containment robustness.

9
Time-dependent memory of hypoxia exposure influences tumor invasion dynamics

Sadhu, G.; Jain, P.; Meena, R. K.; George, J. T.; Jolly, M. K.

2026-04-09 systems biology 10.64898/2026.04.07.716866 medRxiv
Top 0.6%
0.8%
Show abstract

Cancer cells in hypoxic environments often proliferate less but exhibit enhanced migration relative to their normoxic counterparts. Recent in vitro and in silico studies have characterized the role of hypoxic memory - the ability of cancer cells to retain their hypoxic phenotype even when reoxygenated - in tumor invasion. However, the observations have been limited either to exposing cancer cells to hypoxia for a fixed duration or by assuming a fixed-time persistence of the hypoxic state upon reoxygenation independent of the duration of hypoxia exposure. Thus, time-dependent cell-state changes during hypoxia and their impact on hypoxic memory remains unclear. Here, we first analyze transcriptomic data from breast cancer samples to show that the genes upregulated at transcriptional level and hypomethylated at epigenetic level are enriched in cell invasion, indicating hypoxic memory-driven process of tumor invasion. Next, we used a computational model to investigate how the spatial-temporal dynamics of oxygen levels in a tumor drive time-dependent changes in hypoxic memory and influence tumor invasion dynamics. Our simulation results show that such dynamic hypoxic memory can drive enhanced tumor invasion over a fixed hypoxic memory by a) enriching hypoxic cell density at the tumor front, b) reducing sensitivity of hypoxic cell state to fluctuations in oxygen supply, and c) enhancing effective diffusion of hypoxic cells. Our results highlight the crucial role of dynamic hypoxic memory in shaping tumor invasion dynamics, underscoring the need to elucidate its underlying mechanisms in future studies.

10
PyrMol: A Knowledge-Structured Pyramid Graph Framework forGeneralizable Molecular Property Prediction

Li, Y.; Zhao, Q.; Wang, J.

2026-03-20 bioinformatics 10.1101/2025.11.09.686426 medRxiv
Top 0.6%
0.8%
Show abstract

Expert pharmaceutical chemists interpret molecular structures through a sophisticated cognitive hierarchy, transitioning from local functional moieties to spatial pharmacophores and, ultimately, to macroscopic pharmacological and physicochemical profiles. However, conventional Graph Neural Networks frequently overlook this high-level chemical intuition by treating molecules as single-scale atomic topology. To bridge this gap between human expertise and computational inference, we propose PyrMol, a knowledge-structured pyramid representation learning framework. By constructing heterogeneous hierarchical graphs, PyrMol orchestrates information flow across atomic, subgraph, and molecular levels. Crucially, the subgraph level systematically integrates three complementary expert views comprising functional groups, pharmacophores, and retrosynthetic fragments. To harmonize these explicit domain priors with implicit computational semantics, we introduce an adaptive Multi-source Knowledge Enhancement and Fusion module that dynamically balances their complementarity and redundancy. A Hierarchical Contrastive Learning strategy further ensures cross-scale semantic consistency. Empirical evaluations across ten benchmark datasets demonstrate that PyrMol outperforms 12 state-of-the-art baselines. Furthermore, its "plug-and-play" versatility provides a framework-agnostic performance boost for existing GNN architectures. PyrMol thus establishes a principled data-knowledge dual-driven paradigm for AI-aided Drug Discovery, effectively leveraging domain knowledge to catalyze advances in molecular property prediction.

11
Transmission dynamics of the COVID-19 pandemic across the emerging variants in mainland China: a hypergraph-based spatiotemporal modeling study

Wang, Y.; WANG, D.; Lau, Y. C.; Du, Z.; Cowling, B. J.; Zhao, Y.; Ali, S. T.

2026-04-17 public and global health 10.64898/2026.04.16.26351004 medRxiv
Top 0.7%
0.8%
Show abstract

Mainland China experienced multiple waves of COVID19 pandemic during 2020 2022, driven by emerging variants and changes in public health and social measures (PHSMs). We developed a hypergraph-based Susceptible Vaccinated Exposed Infectious Recovered Susceptible (SVEIRS) model to reconstruct epidemic dynamics across 31 provinces, capturing transmission heterogeneity associated with clustered contacts. We assessed key characteristics of transmission at national and provincial levels during four outbreak periods: initial, localized predelta, Delta, and widespread Omicron, which accounted for 96.7% of all infections. We found significant diversity in transmission contributions across cluster sizes, with a small fraction of larger clusters responsible for a disproportionate share of infections. Counterfactual analyses showed that reducing clustersize heterogeneity, while holding overall exposure constant, could have lowered national infections by 11.70 to 30.79%, with the largest effects during Omicron period. Ascertainment rates increased over time but remained spatially heterogeneous with a range: (14.40, 71.93)%. Population susceptibility declined following mass vaccination (to 42.49% in Aug 2021, nationally) and rebounded (to 89.89% in Nov 2022) due to waning immunity with variations across the provinces. Effective reproduction numbers displayed marked temporal and spatial variability, with higher estimates during Omicron. Overall, these results highlight critical role of group contact heterogeneity in shaping epidemic dynamics.

12
Fine-grained spatial data-driven ensemble modeling for predicting Sylvatic Yellow Fever environmental suitability in Brazil

Augusto, D. A.; Abdalla, L.; Krempser, E.; de Oliveira Passos, P. H.; Garkauskas Ramos, D.; Pecego Martins Romano, A.; Chame, M.

2026-04-01 epidemiology 10.64898/2026.03.26.26349443 medRxiv
Top 0.7%
0.8%
Show abstract

Sylvatic Yellow Fever (YF) is an infectious mosquito-borne disease with significant epidemiological relevance due to its widespread distribution and high lethality for human and non-human primates, particularly in tropical regions of the planet such as in Brazil. Identifying regions and periods of high environmental suitability for the occurrence of YF is essential for preventing or mitigating its burden, as it enables the efficient allocation of surveillance efforts, prevention, and implementation of control measures. Environmental modeling of YF occurrence has proven to be an effective approach toward this goal; however, its effectiveness strongly depends on the modeling framework's capabilities as well as the spatial and temporal precision of all associated data. We propose a fine-scale geospatial modeling of YF environmental suitability that is based on a generative machine-learning ensemble method built on a large set of high-resolution environmental covariates. First, we take the spatiotemporal statistical description of the environment of each of the 545 YF cases from 2019--2024 up to 30 m/monthly resolution at three buffer scales: 100 m, 500 m, and 1000 m ratios. Then, we perform a feature selection and train hundreds of One-Class Support Vector Machine submodels to form a robust ensemble model, whose predictions are projected to a 1x1 km resolution grid of Brazil under several metrics, exceeding seven million ensemble evaluations. The predictions ranked the Southern Brazil region with the highest mean suitability for YF, with a level of 0.64; Southeast comes next with 0.46, followed closely by Central-West region (0.44), North (0.39), and finally Northeast (0.28). The model exhibited high uncertainty for the North region, indicating that data collection efforts are much needed in this region. As for the environmental covariates, a feature analysis pointed out that Land use and cover accounts for the largest influence in the model output.

13
Systems analysis of ribosomal CAR-site dynamics

Perez, L.; Iradukunda, M.; Krizanc, D.; Thayer, K.; Weir, M. P.

2026-03-31 systems biology 10.64898/2026.03.28.714829 medRxiv
Top 0.8%
0.7%
Show abstract

Developing approaches to link structure and function is an ongoing challenge in computational and structural biology. Using a systems-level framework, we present here an analysis pipeline in a Python package, mdsa-tools, that constructs network representations of structures in a time series of trajectory frames from molecular dynamics (MD) simulations. Here, we demonstrate its use on a ribosomal subsystem. The subsystem is centered on the CAR interaction surface, a "brake pad" adjacent to the aminoacyl (A-site) decoding center that tunes protein translation rates. We leverage unsupervised learning algorithms to explore the conformational landscape of behaviors visited by two versions of the subsystem (brake-on and brake-off) that differ at the codon 3 adjacent to the A-site codon. Our network representations of MD frames embody H-bond interactions between all pairwise combinations of residues in the system. By utilizing per-frame vector representations of network edges, we can apply standard clustering and dimensionality reduction methods to explore behavioral differences between the brake-on and brake-off versions of the system. K-means clustering of frame vectors revealed a striking separation of the two system versions, consistent with principal components analysis (PCA) embeddings and Uniform Manifold Approximation and Projection (UMAP) embeddings. Dissection of K-means centroids and PCA loadings highlighted H-bond interactions between residue pairs in the ribosomes peptidyl site (P site), suggesting potential allosteric signaling across the subsystem. Author summaryWith the impressive development of computational algorithms to successfully simulate the dynamics of biological molecules over time, the exploration and incorporation of systems modes of analysis is a natural next step to begin to understand the molecular dynamics behaviors that emerge from these experiments. Following the approaches of classical molecular genetics, we used a "computational genetics" paradigm where we introduced changes (mutations) in potentially important residues, changing their identities or modifying their chemical properties, and asked how the dynamic system responded to these changes, viewing the simulations as a series of movie frames of the dynamic structure over time. Starting with network representations of each frames structure, where the nodes are residues, and the edges denote H-bond interactions between the residues, we used several unsupervised machine learning algorithms to uncover behavioral changes in the different mutated versions of the system. Applied to our ribosome neighborhood, this revealed unexpected changes in behavior at the ribosome peptidyl site (P site) in response to mutating mRNA residues on the other side of the aminoacyl site (A site) codon, suggesting long-range allosteric interactions across the neighborhood.

14
TCMCard: A High-Confidence Digital Infrastructure for Traditional Chinese Medicine Quantified by Multi-Dimensional Evidence Integration

Wang, Y.; Dong, W.; Yao, J.; Wang, K.; Zhang, L.; Wang, Y.; Guo, S.; Li, H.; Cai, H.; Wang, X.; Li, Y.

2026-04-10 bioinformatics 10.64898/2026.04.07.716940 medRxiv
Top 0.9%
0.7%
Show abstract

Network pharmacology has become a widely used approach for deciphering multi-component, multi-target mechanisms of traditional Chinese medicine (TCM). Here we introduce TCMCard, a high-confidence digital infrastructure built on a Multi-Dimensional Evidence Integration (MDEI) framework. The framework integrates experimental activity data from authoritative chemical databases, literature-derived evidence, and structure-based similarity inference. Preprocessing steps include chemical structure normalization, species-specific filtering, and target quality scoring. Applied to conventional interaction datasets, this pipeline leads to the removal of over 60% of low-confidence noise. TCMCard supports network pharmacology exploration through an interactive visualization platform, and module analysis identifies functionally relevant communities that offer insights into the synergistic actions of TCM formulas. Overall, TCMCard may help move the field beyond simple data aggregation toward evidence-informed curation and quality-driven analysis. As an interactive and publicly accessible platform, it reveals an organized backbone within complex interaction networks, offering a more reliable basis for understanding multi-component synergy in TCM.

15
BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays

Slenders, E.; Perego, E.; Zappone, S.; Vicidomini, G.

2026-04-10 bioinformatics 10.64898/2026.04.08.717207 medRxiv
Top 1.0%
0.6%
Show abstract

Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.

16
SSPSPredictor: A Sequence and Structure based Deep Learning Model for Predicting Phase-Separating Proteins

Wang, T.; Liao, S.; Qi, Y.; Zhang, Z.

2026-04-01 bioinformatics 10.64898/2026.03.30.715224 medRxiv
Top 1.0%
0.5%
Show abstract

Liquid-liquid phase separation (LLPS) underlies the formation of biomolecular liquid condensates (also referred to membraneless organelles, MLOs), which are essential for spatially organizing various biochemical processes within cells. Proteins that play a key role in driving condensates formation are termed phase-separating proteins (PSPs). Given experimental identification of PSPs remains labor-intensive and time-consuming, multiple computational tools have been developed based on empirical features or deep learning. In this study, we propose SSPSPredictor, a novel multimodal predictive model for PSPs with folded or intrinsically disordered structures, leveraging the fusion of sequence information from a protein language model ESM-2 and structural insights from a graph neural network GVP. Compared with existing tools, SSPSPredictor achieves balanced performance in identifying endogenous PSPs, predicting relative LLPS propensities, and recognizing key regions that drive LLPS. Moreover, SSPSPredictor exhibits good interpretability in identifying driving regions along protein sequences, although no relevant supervision was provided during training. Further predictive analysis of the human proteome using SSPSPredictor reveals that the proportion of intrinsically disordered proteins (IDPs) undergoing LLPS is significantly higher than that of folded proteins. In addition, pathogenic variants, especially those located in disordered regions, exhibit higher LLPS propensity than other mutations, uncovering a link between LLPS and diseases at the amino acid level.

17
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 1%
0.5%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

18
AI in Practice: A Multilingual Survey of 2025 BioHackathon Participants

Sriwichai, N.; Feriau, L.; Tongyoo, P.; Noda, Y.; Gyoji, H.; Noisagul, P.; Goto, S.; Steinberg, D.; Wangsanuwat, C.

2026-03-27 scientific communication and education 10.64898/2026.03.25.713611 medRxiv
Top 1%
0.5%
Show abstract

This dataset arises from a multilingual survey of AI use among participants and community members in the DBCLS BioHackathon 2025 in Japan. The questionnaire, offered in English, Japanese, and Thai, asked about how often respondents use AI tools, what they use them for, obstacles they encounter, institutional support, satisfaction, and concerns. Additional items captured role, institution type, work country, and other demographics, totaling 105 responses. The dataset includes both raw anonymized responses and a cleaned, standardized English-only version suitable for quantitative analysis, along with the full questionnaire, a data dictionary for cleaned dataset, and a translation lookup table. Free-text answers were screened and redacted to remove URLs, names, and other potentially identifiable information. Together, these materials provide a community-level view of AI practice in genomics, bioinformatics, software development, and related areas, and can support work on AI adoption, policy, and methods for analyzing survey data on AI use in science.

19
Simulation of neurotransmitter release and its imaging by fluorescent sensors

Gretz, J.; Mohr, J. M.; Hill, B. F.; Andreeva, V.; Erpenbeck, L.; Kruss, S.

2026-03-25 neuroscience 10.64898/2026.03.23.707923 medRxiv
Top 1%
0.4%
Show abstract

Cells release signaling molecules such as neurotransmitters that diffuse through the extracellular space and bind to receptors. These signaling molecules can be detected by fluorescent sensors/probes to provide images of the signaling process. Such images are not equivalent to a concentration because diffusion and sensor kinetics affect (convolute) them. Therefore, computational approaches are necessary to disentangle these contributions and allow interpretation of fluorescent sensor-based images. Here, we present a kinetic Monte Carlo framework (FLuorescence Imaging Kinetic Simulation, FLIKS) that simulates signaling molecules undergoing cellular release, stochastic diffusion and reversible binding to sensors in realistic cellular (2D or 3D) geometries. We apply it to model neurotransmitter (dopamine) release in synaptic clefts and for paracrine signaling by immune cells. We also show how sensor location, sensor kinetics and release location affect fluorescence images. For example, we show how sensor sensitivity depends on the distance from the synaptic cleft and changes when dopamine transporters (DAT) clear dopamine. The approach also allows to compare the performance of membrane bound (genetically encoded) sensors versus artificial sensors such as nanosensors placed outside under or around the cells. As an example, we also demonstrate how the images of catecholamine release by immune cells can be modeled and compared to experimental data to better understand the release pattern. This framework provides a quantitative basis for analyzing and interpreting fluorescent sensor imaging data.

20
The results of Transcriptome-wide Mendelian Randomization (TWMR) in large-scale populations can directly validate, across scales, the results of causal inference from deep learning combined with double machine learning on single-cell transcriptomes of human samples.

ye, w.; Jiang, X.; Shen, F.

2026-03-19 rheumatology 10.64898/2026.03.16.26348532 medRxiv
Top 1%
0.4%
Show abstract

ObjectiveAiming at the core problems prevalent in biomedical research, including the "translational distance", the difficulty in aligning cross-scale studies, and the lack of direct validation of single-cell systems biology models in human samples, this study aims to verify whether the results of transcriptome-wide Mendelian randomization (TWMR) based on large-scale populations are consistent with the causal inference results of deep learning combined with double machine learning (DML) using single-cell transcriptome data from human samples, to clarify whether statistical biology and systems biology can converge to the same biological truth, and provide methodological support for mechanism dissection and precision medicine research of complex diseases such as rheumatoid arthritis (RA). MethodsThis study integrated multi-omics data to conduct a two-stage causal inference and cross-scale validation analysis. In the first stage, based on the summary statistics of RA genome-wide association study (GWAS) from 456,348 individuals of European ancestry in the UK Biobank (UKB), and cis-expression quantitative trait locus (cis-eQTL) data from 31,684 individuals in the eQTLGen Consortium, a two-sample Mendelian randomization approach was adopted. Transcriptome-wide causal effect analysis was performed using the inverse-variance weighted (IVW) method, MR Egger regression, and weighted median method, and gene-level causal effect values were obtained after strict quality control and multiple testing correction. In the second stage, based on single-cell RNA sequencing (scRNA-seq) data from RA patients and healthy controls (RA group: 11 samples, 211,867 cells; Healthy control group: 38 samples, 456,631 cells), after preprocessing via the Seurat pipeline, batch effect correction, and cell type annotation, a hierarchical deep neural network was constructed to complete feature compression of high-dimensional expression data, and the DML framework was used to estimate the causal effects of genes on RA disease status. Finally, Pearson correlation analysis was performed to conduct cell type-specific cross-scale validation of gene-level causal effect values obtained by the two methods, and the validated model was used to quantify the causal effects of 16 RA-related pathways from the Reactome database. ResultsThis study confirmed that the gene causal effect values obtained from large-scale population TWMR analysis were significantly correlated with those calculated by the deep learning combined with DML model based on single-cell transcriptome data. Among them, the correlation was extremely significant (p<0.001) in core naive B cells (r=0.202, p=3.2e-05, n=414) and core naive CD4 T cells (r=0.102, p=0.037, n=412). The validated DML model successfully quantified the cell type-specific causal effect values of 16 RA-related signaling pathways. ConclusionStatistical biology and systems biology can converge to the same biological truth. The cross-scale consistency between the two can significantly shorten the "translational distance" in biomedical research, and realizes the direct validation of the single-cell systems biology causal model of human samples based on large-scale population genetic data, getting rid of the excessive dependence on animal/cell experimental models in traditional research. This research paradigm not only provides a new path for mechanism dissection and therapeutic target screening of complex diseases such as RA, but also provides a feasible solution for rare disease research to break through the limitation of GWAS sample size, and lays an important theoretical and methodological foundation for constructing standardized systems biology models of human complex diseases and promoting the development of precision medicine.